Coffee Shop Sales Analysis¶

Mini data project -exploring sales trends, product performance and time patterns.

***Author:** Sabína Mikušovičová Tools: Python (Pandas, Seaborn, Matplotlib

🎯 Objective¶

The goal of this project is to explore and visualize sales data from a coffee shop in order to identify:

  • Top-selling coffee types
  • Sales trends over time (years, months, hours)
  • Busy hours and days of the week
  • Average price evolution and total revenue growth

🧰 Tools & Libraries¶

  • Python: pandas, numpy, matplotlib, seaborn
  • Environment: JupyterLab
  • Data visualization: Heatmaps, bar charts, line plots
  • Dataset: Coffee shop transaction data (anonymized)

📂 Notebook Structure¶

  1. Data Loading & Preparation
  2. Exploratory Data Analysis (EDA)
  3. Time-based Trends
  4. Product Sales Analysis
  5. Insights & Conclusions

Data Loading and Preparation¶

In [52]:
import pandas as pd
import datetime as dt
import numpy as np
In [53]:
df = (
    pd.read_csv('Coffe_sales.csv')
)
df.head()
Out[53]:
hour_of_day cash_type money coffee_name Time_of_Day Weekday Month_name Weekdaysort Monthsort Date Time
0 10 card 38.7 Latte Morning Fri Mar 5 3 2024-03-01 10:15:50.520000
1 12 card 38.7 Hot Chocolate Afternoon Fri Mar 5 3 2024-03-01 12:19:22.539000
2 12 card 38.7 Hot Chocolate Afternoon Fri Mar 5 3 2024-03-01 12:20:18.089000
3 13 card 28.9 Americano Afternoon Fri Mar 5 3 2024-03-01 13:46:33.006000
4 13 card 38.7 Latte Afternoon Fri Mar 5 3 2024-03-01 13:48:14.626000

Exploratory Data Analysis¶

In [55]:
df.head()
Out[55]:
hour_of_day cash_type money coffee_name Time_of_Day Weekday Month_name Weekdaysort Monthsort Date Time
0 10 card 38.7 Latte Morning Fri Mar 5 3 2024-03-01 10:15:50.520000
1 12 card 38.7 Hot Chocolate Afternoon Fri Mar 5 3 2024-03-01 12:19:22.539000
2 12 card 38.7 Hot Chocolate Afternoon Fri Mar 5 3 2024-03-01 12:20:18.089000
3 13 card 28.9 Americano Afternoon Fri Mar 5 3 2024-03-01 13:46:33.006000
4 13 card 38.7 Latte Afternoon Fri Mar 5 3 2024-03-01 13:48:14.626000
In [56]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3547 entries, 0 to 3546
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   hour_of_day  3547 non-null   int64  
 1   cash_type    3547 non-null   object 
 2   money        3547 non-null   float64
 3   coffee_name  3547 non-null   object 
 4   Time_of_Day  3547 non-null   object 
 5   Weekday      3547 non-null   object 
 6   Month_name   3547 non-null   object 
 7   Weekdaysort  3547 non-null   int64  
 8   Monthsort    3547 non-null   int64  
 9   Date         3547 non-null   object 
 10  Time         3547 non-null   object 
dtypes: float64(1), int64(3), object(7)
memory usage: 304.9+ KB
In [57]:
df['Date']=pd.to_datetime(df['Date'])
df['Time']=pd.to_datetime(df['Time'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3547 entries, 0 to 3546
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   hour_of_day  3547 non-null   int64         
 1   cash_type    3547 non-null   object        
 2   money        3547 non-null   float64       
 3   coffee_name  3547 non-null   object        
 4   Time_of_Day  3547 non-null   object        
 5   Weekday      3547 non-null   object        
 6   Month_name   3547 non-null   object        
 7   Weekdaysort  3547 non-null   int64         
 8   Monthsort    3547 non-null   int64         
 9   Date         3547 non-null   datetime64[ns]
 10  Time         3547 non-null   datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(3), object(5)
memory usage: 304.9+ KB
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\34177654.py:2: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  df['Time']=pd.to_datetime(df['Time'])
In [58]:
df['coffee_name'].nunique()
Out[58]:
8

Product Sales Analysis¶

In [60]:
df["year"]=df['Date'].dt.year

df_sales=(
    df.pivot_table(
        index=["coffee_name","year",'Month_name'],
        aggfunc={'money':['mean','sum']})
     .droplevel(0, axis='columns')
    .rename({'mean': 'avg_price', 'sum': 'total_units_sold'}, axis='columns')
)
df_sales.sample(10)
Out[60]:
avg_price total_units_sold
coffee_name year Month_name
Cappuccino 2024 Sep 33.178537 1360.32
Cortado 2024 Mar 28.900000 809.20
Espresso 2025 Mar 21.060000 210.60
Americano 2025 Jan 25.960000 649.00
2024 Nov 25.960000 649.00
Cappuccino 2024 Dec 35.760000 1358.88
May 37.720000 1961.44
Cortado 2024 May 27.920000 474.64
Cocoa 2024 Mar 38.700000 232.20
Cortado 2025 Jan 25.960000 571.12
In [61]:
df_total=(
    df_sales
    .assign(sales=lambda x:x['avg_price'] *x ['total_units_sold']).round(2)
    .sort_values(['total_units_sold'],ascending=False)
    .reset_index()
    
)
df_total
Out[61]:
coffee_name year Month_name avg_price total_units_sold sales
0 Latte 2024 Oct 35.76 4291.20 153453.31
1 Latte 2024 Sep 33.13 3114.48 103191.34
2 Americano 2025 Feb 25.96 3037.32 78848.83
3 Americano with Milk 2024 Sep 28.17 2930.14 82555.00
4 Americano 2025 Mar 25.96 2647.92 68740.00
... ... ... ... ... ... ...
99 Cocoa 2024 Jun 37.72 150.88 5691.19
100 Espresso 2025 Jan 21.06 105.30 2217.62
101 Espresso 2024 Apr 24.00 96.00 2304.00
102 Espresso 2024 Nov 21.06 63.18 1330.57
103 Cortado 2025 Mar 25.96 51.92 1347.84

104 rows × 6 columns

In [62]:
filtered=(
    df_total
    .groupby('coffee_name')['sales']
    .sum()
    .sort_values(ascending=False)
    .head(3)
    .index.tolist()
)
filtered
Out[62]:
['Latte', 'Americano with Milk', 'Cappuccino']
In [63]:
df_top3=df_total.query("coffee_name in @filtered")
df_top3.sample(7)
Out[63]:
coffee_name year Month_name avg_price total_units_sold sales
25 Americano with Milk 2025 Jan 30.86 1604.72 49521.66
0 Latte 2024 Oct 35.76 4291.20 153453.31
22 Americano with Milk 2024 Dec 30.86 1759.02 54283.36
18 Americano with Milk 2024 Jul 28.67 1863.80 53442.31
27 Americano with Milk 2025 Mar 30.86 1604.72 49521.66
24 Latte 2024 Dec 35.76 1680.72 60102.55
1 Latte 2024 Sep 33.13 3114.48 103191.34
In [64]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
In [65]:
fig_units=(
    px.bar(
        df_top3,
        x="Month_name", 
        y="total_units_sold", 
        color='coffee_name',
        barmode='group',
        facet_col='year',
        category_orders={
           'Month_name': ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
            'year':[2024,2025]}
    )
    .update_layout(
        title='Monthly Trends in Units Sold of TOP 3 products (March 2024- March 2025)',
        title_x=0.5,
        xaxis_title=' ',
        yaxis_title=' '
    )
)

fig_units.show()
  • The graph shows the monthly number of coffees sold between March 2024 and March 2025. Sales gradually increase until autumn, peaking between September and November, before declining slightly over the winter months.
In [67]:
fig_sales=(
    px.bar(
        df_top3,
        x="Month_name", 
        y="sales", 
        color='coffee_name',
        barmode='group',
        facet_col='year',
        category_orders={
           'Month_name': ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
            'year':[2024,2025]}
    )
    .update_layout(
        title='Monthly Sales Performance of TOP 3 products (March 2024- March 2025)',
        title_x=0.5,
        xaxis_title=' ',
        yaxis_title=' '
    )
)

fig_sales.show()
  • This graph illustrates the café´s total sales over the same period in 2 years. The trend closely follows the unit sold, indicating a strong positive correlation between the quantity sold and overall revenue.

Coffee products performance from March 2024 to March 2025¶

The analysis revelas that:

  • The coffee_shop best season is from September to November
  • The Q1 of 2025 outperformed the Q1 of 2024
  • Higher sales volumes correspond to higher total revenue
In [70]:
sns.relplot(
    data=df_total,
    x='year',y ='avg_price',
    kind="line",
    hue="coffee_name",
    marker="o",
    height=3, aspect=1.5
)
plt.title("Coffee Price Trends (March 2024 - March 2025)")
plt.xlabel('')
plt.ylabel('')

plt.show()
No description has been provided for this image

The graph shows that the prices remained stable throughout the period, showing neither a noticeable increase nor a decrease.

Time-Based Sales and Traffic Analysis¶

In [73]:
df.head()
Out[73]:
hour_of_day cash_type money coffee_name Time_of_Day Weekday Month_name Weekdaysort Monthsort Date Time year
0 10 card 38.7 Latte Morning Fri Mar 5 3 2024-03-01 2025-10-14 10:15:50.520 2024
1 12 card 38.7 Hot Chocolate Afternoon Fri Mar 5 3 2024-03-01 2025-10-14 12:19:22.539 2024
2 12 card 38.7 Hot Chocolate Afternoon Fri Mar 5 3 2024-03-01 2025-10-14 12:20:18.089 2024
3 13 card 28.9 Americano Afternoon Fri Mar 5 3 2024-03-01 2025-10-14 13:46:33.006 2024
4 13 card 38.7 Latte Afternoon Fri Mar 5 3 2024-03-01 2025-10-14 13:48:14.626 2024
In [74]:
Weekday_order=['Mon','Tue','Wed', 'Thu','Fri', 'Sat', 'Sun']
df['Weekday']=pd.Categorical(df['Weekday'],categories=Weekday_order,ordered=True)
In [75]:
Time_of_day_order=['Morning','Afternoon','Night']
df['Time_of_Day']=pd.Categorical(df['Time_of_Day'],categories=Time_of_day_order,ordered=True)
In [76]:
pivot_days=(
    df.pivot_table(
        index=["Weekday",'Time_of_Day'],
        aggfunc={'money':['sum']})
     .droplevel(1, axis='columns')
    .rename({ 'sum': 'total_units_sold'}, axis='columns')
    .unstack()
    .droplevel(0, axis='columns')

)
pivot_days
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\2416743940.py:2: FutureWarning:

The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior

Out[76]:
Time_of_Day Morning Afternoon Night
Weekday
Mon 5987.34 5664.10 5711.66
Tue 6343.92 5012.08 6812.38
Wed 5071.32 5147.76 5531.38
Thu 4419.32 5203.58 6468.50
Fri 5839.36 5504.90 5458.40
Sat 4570.62 6307.30 3855.60
Sun 3697.32 5290.32 4348.42
In [77]:
pivot_days_total=pivot_days.copy()
pivot_days_total['Total']=pivot_days.sum(axis=1)
pivot_days_total
Out[77]:
Time_of_Day Morning Afternoon Night Total
Weekday
Mon 5987.34 5664.10 5711.66 17363.10
Tue 6343.92 5012.08 6812.38 18168.38
Wed 5071.32 5147.76 5531.38 15750.46
Thu 4419.32 5203.58 6468.50 16091.40
Fri 5839.36 5504.90 5458.40 16802.66
Sat 4570.62 6307.30 3855.60 14733.52
Sun 3697.32 5290.32 4348.42 13336.06
In [78]:
plt.figure(figsize=(10,6))
sns.heatmap(pivot_days, annot=True, fmt='.0f',cmap='YlGnBu')
plt.title( 'Product Sales Heatmap by Time of Day')
plt.xlabel(' ')
plt.ylabel(' ')
plt.show()
           
No description has been provided for this image
  • The heatmap shows the café´s sales distribution across days of week and times of day, The results reveal clear peak periods, with the highest activity observed during late mornings and early afternoons, particularly on Fridays and weekends.
In [80]:
df_coffee=(
    df.pivot_table(
        index=["coffee_name","Time_of_Day"],
        aggfunc={'money':['sum']})
     .droplevel(1, axis='columns')
    .rename({'sum':'total_units_sold'}, axis='columns')
    .unstack()
    .droplevel(0,axis='columns')
)
df_coffee
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\3694376587.py:2: FutureWarning:

The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior

Out[80]:
Time_of_Day Morning Afternoon Night
coffee_name
Americano 5643.10 6133.94 2873.22
Americano with Milk 10025.52 7384.36 7341.24
Cappuccino 4327.44 5910.70 7201.00
Cocoa 2059.38 2685.92 3775.86
Cortado 3605.46 2315.84 1463.56
Espresso 873.72 1189.16 627.40
Hot Chocolate 1744.40 2899.02 5290.04
Latte 7650.18 9611.10 9614.02
In [81]:
pivot_coffee=df.pivot_table(values='money',
        index='coffee_name',
         columns='Time_of_Day',
         aggfunc='sum')
pivot_coffee['Total']=pivot_coffee.sum(axis=1)
pivot_coffee
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\2230699132.py:1: FutureWarning:

The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior

Out[81]:
Time_of_Day Morning Afternoon Night Total
coffee_name
Americano 5643.10 6133.94 2873.22 14650.26
Americano with Milk 10025.52 7384.36 7341.24 24751.12
Cappuccino 4327.44 5910.70 7201.00 17439.14
Cocoa 2059.38 2685.92 3775.86 8521.16
Cortado 3605.46 2315.84 1463.56 7384.86
Espresso 873.72 1189.16 627.40 2690.28
Hot Chocolate 1744.40 2899.02 5290.04 9933.46
Latte 7650.18 9611.10 9614.02 26875.30
In [82]:
plt.figure(figsize=(10,6))
sns.heatmap(df_coffee, annot=True, fmt='.0f',cmap='YlGnBu')
plt.title( ' Coffee Sales by Time of Day')
plt.xlabel('')
plt.ylabel('')
plt.show()
No description has been provided for this image
  • The heatmap illustrates when the highest number of coffees are sold during a day

Time-based coffee sales overview¶

The analysis of sales performance by day and dime reveals, that:

  • In the mornings, customers prefer lattes and americanos with milk
  • Cappuccinos perform better in the afternoon
  • Espressos have the lowest performance throughout the day
In [85]:
df_hour=(
    df
    .pivot_table(index='hour_of_day', aggfunc={'money': 'sum'})
    .sort_values(by='hour_of_day',ascending=True)
    .reset_index()
)
df_hour.head()
Out[85]:
hour_of_day money
0 6 149.40
1 7 2846.02
2 8 7017.88
3 9 7264.28
4 10 10198.52
In [86]:
fig_hour =(
    px.line(df_hour, x="hour_of_day", y="money")
         .update_layout(
             title='Hourly Coffee Sales',
             title_x=0.5,
             xaxis_title='Hour of Day',                     
            yaxis_title='',
             template = 'plotly_white'
         )
)

fig_hour.show()

⏰ Peak Hours Analysis¶

The analysis of hourly sales reveals that:

  • The busiest hours are 8:00–10:00 AM, which corresponds to the morning coffee rush.
  • A smaller peak is also visible around 3:00–4:00 PM, likely corresponding to afternoon breaks.
  • Sales drop significantly after 6:00 PM, indicating low evening demand.

🧠 Conclusions & Insights¶

  • ☕ Latte was the most popular product across all years.
  • 💡 Prices of coffes stay stable.
  • 📈 Sales increased in Q1 2025 despite prices stability suggests higher customer volume.
  • ⏰ The busiest hours were between 8:00 and 10:00 AM.
  • 🗓️ Weekdays showed higher activity than weekends, consistent with office-hour coffee demand.

These information can be used to optimize staffing schedules or promotional campaings during off-peak hours.

In [89]:
from IPython.display import HTML

HTML(
'''
<script>
    code_show = true;

    function code_toggle() {
        if (code_show) {
            document.querySelectorAll('.jp-CodeCell .jp-InputArea').forEach(function(el) {
               el.style.display = 'none';
            });
        } else {
            document.querySelectorAll('.jp-CodeCell .jp-InputArea').forEach(function(el) {
               el.style.display = 'block';
            });
        }
        code_show = !code_show
    }
</script>
<form action="javascript:code_toggle()">
    <input type="submit" value="Show / Hide code">
</form>
'''
)
Out[89]:

Analysis of factors impacting the sales performance (Decision Tree Model)¶

Objective:¶

  • Identification of features, which have the biggest impact on the units sold in the coffee shop

Model used: DecisionTreeRegressor¶

In [93]:
from sklearn.model_selection  import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
In [126]:
df.sample(10)
Out[126]:
hour_of_day cash_type money coffee_name Time_of_Day Weekday Month_name Weekdaysort Monthsort Date Time year
841 19 card 27.92 Americano with Milk Night Sat Jul 6 7 2024-07-06 2025-10-14 19:10:39.580 2024
1453 12 card 23.02 Cortado Afternoon Thu Sep 4 9 2024-09-12 2025-10-14 12:32:22.981 2024
1207 21 card 27.92 Americano with Milk Night Thu Aug 4 8 2024-08-15 2025-10-14 21:36:11.033 2024
3340 15 card 35.76 Cappuccino Afternoon Sun Mar 7 3 2025-03-09 2025-10-14 15:12:41.404 2025
1047 11 card 27.92 Americano with Milk Morning Thu Aug 4 8 2024-08-01 2025-10-14 11:22:46.369 2024
1448 10 card 18.12 Espresso Morning Thu Sep 4 9 2024-09-12 2025-10-14 10:57:25.619 2024
1537 22 card 27.92 Americano with Milk Night Sat Sep 6 9 2024-09-21 2025-10-14 22:18:46.088 2024
2683 7 card 35.76 Latte Morning Tue Jan 2 1 2025-01-14 2025-10-14 07:37:42.461 2025
3494 12 card 25.96 Americano Afternoon Thu Mar 4 3 2025-03-20 2025-10-14 12:48:24.389 2025
3008 17 card 30.86 Americano with Milk Night Fri Feb 5 2 2025-02-14 2025-10-14 17:25:24.871 2025
In [182]:
df_2=(
    df.pivot_table(
        index=['Month_name','Weekday',"coffee_name",'Time_of_Day',],
        aggfunc={'money':['mean','sum']})
     .droplevel(0, axis='columns')
    .rename({'mean': 'avg_price', 'sum': 'total_units_sold'}, axis='columns')

)
df_2.sample(10)
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\2766130499.py:2: FutureWarning:

The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior

Out[182]:
avg_price total_units_sold
Month_name Weekday coffee_name Time_of_Day
Dec Tue Hot Chocolate Night 35.76 143.04
Sun Hot Chocolate Afternoon NaN 0.00
Feb Fri Americano with Milk Morning 30.86 185.16
Mon Cocoa Afternoon 35.76 214.56
Apr Thu Cortado Night NaN 0.00
Feb Sat Americano with Milk Morning 30.86 92.58
Jun Fri Hot Chocolate Morning NaN 0.00
Nov Wed Latte Afternoon 35.76 107.28
Jul Thu Espresso Afternoon NaN 0.00
Dec Sat Cortado Morning 25.96 51.92
In [186]:
df_2=df_2.reset_index()
df_2
Out[186]:
index Month_name Weekday coffee_name Time_of_Day avg_price total_units_sold
0 0 Apr Mon Americano Morning 28.655 114.62
1 1 Apr Mon Americano Afternoon 27.920 27.92
2 2 Apr Mon Americano Night 28.900 57.80
3 3 Apr Mon Americano with Milk Morning 33.604 168.02
4 4 Apr Mon Americano with Milk Afternoon 32.820 32.82
... ... ... ... ... ... ... ...
2011 2011 Sep Sun Hot Chocolate Afternoon NaN 0.00
2012 2012 Sep Sun Hot Chocolate Night 32.820 65.64
2013 2013 Sep Sun Latte Morning 32.820 32.82
2014 2014 Sep Sun Latte Afternoon 34.584 172.92
2015 2015 Sep Sun Latte Night 33.408 167.04

2016 rows × 7 columns

In [192]:
df_2.drop(columns=['index'])
Out[192]:
Month_name Weekday coffee_name Time_of_Day avg_price total_units_sold
0 Apr Mon Americano Morning 28.655 114.62
1 Apr Mon Americano Afternoon 27.920 27.92
2 Apr Mon Americano Night 28.900 57.80
3 Apr Mon Americano with Milk Morning 33.604 168.02
4 Apr Mon Americano with Milk Afternoon 32.820 32.82
... ... ... ... ... ... ...
2011 Sep Sun Hot Chocolate Afternoon NaN 0.00
2012 Sep Sun Hot Chocolate Night 32.820 65.64
2013 Sep Sun Latte Morning 32.820 32.82
2014 Sep Sun Latte Afternoon 34.584 172.92
2015 Sep Sun Latte Night 33.408 167.04

2016 rows × 6 columns

In [194]:
y=df_2['total_units_sold']
coffee_features=['Month_name','Weekday','Time_of_Day','coffee_name','avg_price']
In [196]:
X=df_2[coffee_features]
In [198]:
X.head()
Out[198]:
Month_name Weekday Time_of_Day coffee_name avg_price
0 Apr Mon Morning Americano 28.655
1 Apr Mon Afternoon Americano 27.920
2 Apr Mon Night Americano 28.900
3 Apr Mon Morning Americano with Milk 33.604
4 Apr Mon Afternoon Americano with Milk 32.820
In [214]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2016 entries, 0 to 2015
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   Month_name   2016 non-null   object  
 1   Weekday      2016 non-null   category
 2   Time_of_Day  2016 non-null   category
 3   coffee_name  2016 non-null   object  
 4   avg_price    1311 non-null   float64 
dtypes: category(2), float64(1), object(2)
memory usage: 51.8+ KB
In [220]:
string_cols=X.select_dtypes(exclude='float').columns
string_cols
Out[220]:
Index(['Month_name', 'Weekday', 'Time_of_Day', 'coffee_name'], dtype='object')
In [236]:
X_final=pd.get_dummies(X,columns=string_cols)
X_final
Out[236]:
avg_price Month_name_Apr Month_name_Aug Month_name_Dec Month_name_Feb Month_name_Jan Month_name_Jul Month_name_Jun Month_name_Mar Month_name_May ... Time_of_Day_Afternoon Time_of_Day_Night coffee_name_Americano coffee_name_Americano with Milk coffee_name_Cappuccino coffee_name_Cocoa coffee_name_Cortado coffee_name_Espresso coffee_name_Hot Chocolate coffee_name_Latte
0 28.655 True False False False False False False False False ... False False True False False False False False False False
1 27.920 True False False False False False False False False ... True False True False False False False False False False
2 28.900 True False False False False False False False False ... False True True False False False False False False False
3 33.604 True False False False False False False False False ... False False False True False False False False False False
4 32.820 True False False False False False False False False ... True False False True False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2011 NaN False False False False False False False False False ... True False False False False False False False True False
2012 32.820 False False False False False False False False False ... False True False False False False False False True False
2013 32.820 False False False False False False False False False ... False False False False False False False False False True
2014 34.584 False False False False False False False False False ... True False False False False False False False False True
2015 33.408 False False False False False False False False False ... False True False False False False False False False True

2016 rows × 31 columns

In [252]:
train_X, val_X, train_y, val_y = train_test_split(X_final, y, random_state = 0)
In [248]:
print(X_final.index)
RangeIndex(start=0, stop=2016, step=1)
In [246]:
print(y.index)
RangeIndex(start=0, stop=2016, step=1)
In [420]:
model=DecisionTreeRegressor(max_depth=7,min_samples_split=10,random_state=1)


model.fit(train_X,train_y)
Out[420]:
DecisionTreeRegressor(max_depth=7, min_samples_split=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=7, min_samples_split=10, random_state=1)
In [422]:
predictions=model.predict(val_X)
print(predictions[:10])
print(val_y.head(10).values)
[ 51.8640678  120.85714286   0.           0.         146.58
   0.          51.8640678   30.08571429  91.36468085  51.8640678 ]
[ 27.92 113.16   0.     0.    83.76   0.    51.92  42.12  35.76  25.96]
In [334]:
from sklearn.metrics import mean_absolute_error, r2_score
In [424]:
mae=mean_absolute_error(predictions,val_y)
r2=r2_score(predictions,val_y)
print(f"mae{mae:.2f}")
print(f"R2{r2:.3f}")
mae29.20
R20.251
In [426]:
importance=pd.DataFrame({
    'Feature':train_X.columns,
    'Importance':model.feature_importances_}).sort_values(by='Importance',ascending=False)
print(importance.head(10))
                            Feature  Importance
0                         avg_price    0.654262
23            coffee_name_Americano    0.057551
30                coffee_name_Latte    0.038540
24  coffee_name_Americano with Milk    0.034684
8                    Month_name_Mar    0.032512
22                Time_of_Day_Night    0.029497
4                    Month_name_Feb    0.025779
13                      Weekday_Mon    0.020002
11                   Month_name_Oct    0.017722
18                      Weekday_Sat    0.017041
  • The highest impact for sale have price, type of coffee and month of a year.
In [406]:
y_train_pred=model.predict(train_X)
y_test_pred=model.predict(val_X)
In [428]:
plt.figure(figsize=(8,6))
plt.scatter(train_y,y_train_pred,label='Train')
plt.scatter(val_y,y_test_pred,label='Test')
plt.plot([y.min(),y.max()],[y.min(),y.max()],'r--')
plt.xlabel('Real sales')
plt.ylabel('Predict sales')
plt.legend()
plt.show()
No description has been provided for this image